System and method for data organization, optimization and analytics

ABSTRACT

A system and method for data organization, optimization and analytics includes a web server, thrift server, distributed processing framework, key value store, distributed file system, and relational database. The web server provides a method whereby users issue control actions and query for records via interaction with the thrift server. The thrift server is the center of coordination and communication for the system and interacts with other system elements. The key value store organizes all of the operational data for the system. The key value store runs on a highly scalable distributed system, including a distributed file system for storage of data on disk. The distributed processing framework enables data to be processed in bulk and is used to execute analytical processing on the data. The relational database hold all of the administrative data in the system. Search queries are submitted by end user and results of the search query are sent from the web server to the end user. The web server sends control actions to queue background map reduce jobs. These jobs run in the distributed processing framework and are used to write data and indexes and execute bulk analytics against the key value store.

COPYRIGHT NOTICE

This disclosure is protected under United States and/or InternationalCopyright Laws. © 2020 Koverse, Inc. All Rights Reserved. A portion ofthe disclosure of this patent document contains material which issubject to copyright protection. The copyright owner has no objection tothe facsimile reproduction by anyone of the patent document or thepatent disclosure, as it appears in the Patent and/or Trademark Officepatent file or records, but otherwise reserves all copyright rightswhatsoever.

PRIORITY CLAIM

The present application is a continuation of U.S. application Ser. No.16/746,445 filed Jan. 17, 2020; which application is a continuation ofU.S. application Ser. No. 14/738,649 filed Jun. 12, 2015; whichapplication claims priority from U.S. Provisional Patent ApplicationSer. No. 62/012,201 filed Jun. 13, 2014, which is incorporated byreference as if fully set forth herein.

FIELD OF THE INVENTION

This invention is in the field of data storage and processing for thepurposes of deriving insights and intelligence from that data.

BACKGROUND OF THE INVENTION

In data management, a topic of keen interest to many organizations ishow to effectively develop and utilize analytics to impact any and allaspects of business. The most important factor in the value provided byan analytic is not the sophistication, scale or accuracy of the insightsprovided, but how successfully analytics are integrated into the missionof the organization.

If an organization is going to constantly advance the way that it usesdata and analytics, all parties involved in the creation and usage ofanalytics; Data Engineers, Data Scientists, Business Analysts andBusiness Consumers may work as an integrated team, constantly evolvingtechniques and procedures to better leverage their data for theirbusiness.

BRIEF DESCRIPTION OF THE DRAWINGS

Preferred and alternative examples of the present invention aredescribed in detail below with reference to the following drawings:

FIG. 1 is an architecture diagram showing the relationship betweenvarious components of the system, including key value store, web serverand relational database;

FIG. 2 is a schematic diagram of an index table;

FIG. 3 is a schematic diagram showing the relationship between fourtables in the key value store and the processes that maintain them;

FIG. 4 is a schematic diagram of the record table of FIG. 3;

FIG. 5 is a flowchart of a notional example of an analytic workflow;

FIG. 6 is a schematic diagram of a pluggable transform framework;

FIG. 7 is a system flowchart illustrating the security architecture thatenforces both mandatory and role base access control;

FIG. 8 is a system flowchart illustrating a data ingest and samplingarchitecture; and,

FIG. 9 is a flow chart illustrating various transform steps.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

This patent application describes one or more embodiments of the presentinvention. It is to be understood that the use of absolute terms, suchas “must,” “will,” and the like, as well as specific quantities, is tobe construed as being applicable to one or more of such embodiments, butnot necessarily to all such embodiments. As such, embodiments of theinvention may omit, or include a modification of, one or more featuresor functionalities described in the context of such absolute terms.

Embodiments of the invention may be operational with numerous generalpurpose or special purpose computing system environments orconfigurations. Examples of well known computing systems, environments,and/or configurations that may be suitable for use with the inventioninclude, but are not limited to, personal computers, server computers,hand-held or laptop devices, multiprocessor systems,microprocessor-based systems, set top boxes, programmable consumerelectronics, network PCs, minicomputers, mainframe computers,distributed computing environments that include any of the above systemsor devices, and the like.

Embodiments of the invention may be described in the general context ofcomputer-executable instructions, such as program modules, beingexecuted by a computer and/or by computer-readable media on which suchinstructions or modules can be stored. Generally, program modulesinclude routines, programs, objects, components, data structures, etc.that perform particular tasks or implement particular abstract datatypes. The invention may also be practiced in distributed computingenvironments where tasks are performed by remote processing devices thatare linked through a communications network. In a distributed computingenvironment, program modules may be located in both local and remotecomputer storage media including memory storage devices.

Embodiments of the invention may include or be implemented in a varietyof computer readable media. Computer readable media can be any availablemedia that can be accessed by a computer and includes both volatile andnonvolatile media, removable and non-removable media. By way of example,and not limitation, computer readable media may comprise computerstorage media and communication media. Computer storage media includevolatile and nonvolatile, removable and non-removable media implementedin any method or technology for storage of information such as computerreadable instructions, data structures, program modules or other data.Computer storage media includes, but is not limited to, RAM, ROM,EEPROM, flash memory or other memory technology, CD-ROM, digitalversatile disks (DVD) or other optical disk storage, magnetic cassettes,magnetic tape, magnetic disk storage or other magnetic storage devices,or any other medium which can be used to store the desired informationand which can accessed by computer. Communication media typicallyembodies computer readable instructions, data structures, programmodules or other data in a modulated data signal such as a carrier waveor other transport mechanism and includes any information deliverymedia. The term “modulated data signal” means a signal that has one ormore of its characteristics set or changed in such a manner as to encodeinformation in the signal. By way of example, and not limitation,communication media includes wired media such as a wired network ordirect-wired connection, and wireless media such as acoustic, RF,infrared and other wireless media. Combinations of the any of the aboveshould also be included within the scope of computer readable media.

According to one or more embodiments, the combination of software orcomputer-executable instructions with a computer-readable medium resultsin the creation of a machine or apparatus. Similarly, the execution ofsoftware or computer-executable instructions by a processing deviceresults in the creation of a machine or apparatus, which may bedistinguishable from the processing device, itself, according to anembodiment.

Correspondingly, it is to be understood that a computer-readable mediumis transformed by storing software or computer-executable instructionsthereon. Likewise, a processing device is transformed in the course ofexecuting software or computer-executable instructions. Additionally, itis to be understood that a first set of data input to a processingdevice during, or otherwise in association with, the execution ofsoftware or computer-executable instructions by the processing device istransformed into a second set of data as a consequence of suchexecution. This second data set may subsequently be stored, displayed,or otherwise communicated. Such transformation, alluded to in each ofthe above examples, may be a consequence of, or otherwise involve, thephysical alteration of portions of a computer-readable medium. Suchtransformation, alluded to in each of the above examples, may also be aconsequence of, or otherwise involve, the physical alteration of, forexample, the states of registers and/or counters associated with aprocessing device during execution of software or computer-executableinstructions by the processing device.

As used herein, a process that is performed “automatically” may meanthat the process is performed as a result of machine-executedinstructions and does not, other than the establishment of userpreferences, require manual effort.

Key to an effective data team is their ability to lower their time tomarket for new analytics and enhancements. The primary advantage of thisinvention is that through all of its integrated features it dramaticallylowers the time to market for new analytics. Those key features are.

Modular and Reusable: data sources, data sets, data flows, analytics andapplications are modular such that they can easily be repurposed ormodified to solve new use cases.

Adaptive Schemas: ingesting new datasets into a data warehouse orsimilar analytic platform requires specification of the data schema upfront. This invention is an adaptive schema architecture and hence ableto efficiently ingest new data and adapt to its schema, resulting inlittle or no up-front work.

Data Discovery: knowing what data is relevant to a given analytic needcan sometimes be a challenge. Frequently data is locked away in siloeddatabases making it impossible determine which datasets containinformation relevant to a given problem. Within a this system, all datais indexed by default, and as a result, the data discovery process onlytakes a few minutes via simple interactive queries to identify thespecific data records, documents or datasets most relevant to theproblem at hand. This combined with the adaptive schema architectureresults in a very powerful capability.

Prototype to Product: Once an operationally useful analytic isdeveloped, it's frequently the case that the analytic needs to beproductized. Workflows need to be formalized, access controls andauditing functionality needs to be added, the analytic needs to bescaled to support the end users daily needs. This invention provides allof these functions to all data and analytics implemented within itautomatically, as such operationalizing an experimental analytic is atrivial matter.

To appreciate the impact of this invention, one must consider the rolesinvolved in integrating data into business:

Data Engineers: know and are able to access and retrieve data fromsystems and data sources throughout the organization.

Data Scientists: have deep knowledge of data, algorithms, statistics andcomputational methods. They develop and apply statistical models andalgorithms on the data. They work with data engineers to clean and unifydata and produce summary datasets and statistical models for use by theBusiness Analysts and Consumers.

Business Analysts: know the data in detail and the business needs of theorganization. They leverage a range of visualization and businessintelligence tools to develop business intelligence reporting. Thisinvention provides them quick access to a range of datasets, integratedinto their existing tools. They produce analytic reports for usethroughout the organization.

Business Consumers: are drawn from all parts and levels of thebusinesses. They benefit from quick easy query on raw data in the lakeand also from pre-computed insights embedded into existing and newapplications. Business Consumers provide feedback to the rest of thedata team on the usefulness and suggested improvements of analytics.

Working as a team, via constant interaction, these four roles canaddress a myriad of business needs, leveraging any and all data andintegrating it into business processes, and by using this invention theycan do it with a development cycle of days instead of months. How thedata is organized in the key-value system for efficient batch updates ofderivative data sets.

A preferred embodiment of the present invention is directed to datastorage, processing and hosting system for the rapid prototyping andoperationalization of analytics to support multiple use cases acrossmultiple datasets within a single system.

Turning now to FIG. 1, an architecture diagram illustrates therelationship between components of a system 100 and method according topreferred embodiment of the present invention: namely, a web server 1,thrift server 2, distributed processing framework 51, key value store52, distributed file system 53, web server 1 and relational database 3(also referred to herein as management database). Web server 1 providesa method whereby users issue control actions 54 and query for recordsvia interaction with the thrift server 2. The thrift server 2 is thecenter of coordination and communication for the system 100 andinteracts with other system elements. The key value store 52 organizesall of the operational data for the system 100. The key value store 2 isintended to be running on a highly scalable distributed system,including a distributed file system 53 for storage of data on disk, andlarge scale data to be stored, managed and hosted in the file system.The distributed processing framework 51 enables data to be processed inbulk and is used to execute analytical processing 55 on the data. Therelational database 3 hold all of the administrative data in the system100, including logins, audit information, and job scheduling data.Search query(ies) 16 are submitted by end user 17 and are expected to bereturned by the web server 1, preferably in a few seconds. Results 18 ofthe search query (16) are sent from the web server 1 to the end user 17and are originally drawn from the key value store 52 by the web server 1via the thrift server 2. The web server 1 sends control actions 54 toqueue 56 via RPC calls to the thrift server 2 background map reduce jobs55. These jobs run in the distributed processing framework 51 and areused to write data and indexes , and execute transforms against the keyvalue store 2.

FIG. 2 illustrates index table 12, one of the four tables in the keyvalue store 52. These tables store pre-computed information aboutoriginal records and are used to retrieve records according to specificsearch criteria quickly, for presenting users with a summary of the datain a collection, and for allowing users to retrieve a sample of the datain a collection. Index table 12 stores all of the indexes both compositeand regular. FIG. 3 shows the relationship between the four tables inthe key value store and the processes that maintain them. The primarytable is the record table 11, which contains a serialized raw record ofdata in the system. A map reduce stats job 8 processes data within therecord table 11 and populates stats table 13 with the result. A mapreduce indexing job 10 processes data in the record table 11 andpopulates the index table 12. A composite index map reduce job 9processes data in the record 11 table and populates composite index 14.Finally, transforms 15 process data in the record table 11 and writesthe results back to the event table 11.

As illustrated in FIGS. 2 and 3, the record table 11 is the table thatenables all processing of operational data. By comparison, the othertables 12, 13, 14 enable search and data assessment. In looking at howdata is stored in the event table 11, storing records on disk in timeorder is desirable so that a new set of records can be processed byprocesses interested in updating derivative data structures such asindexes and analytical result sets. However, in a distributed key-valuesystem with sorted keys, time-ordered data forces new records to go toone machine, eliminating parallelism in the system. Using appropriatekey structure consisting of bucket IDs added to the beginning and batchIDs the system can store new records in multiple machines in parallel,and read a new set of records in parallel for the purposes of updatingdownstream data structures. This results in very high performance forboth query and bulk analytics.

The record table 11, as illustrated in FIG. 4, has four elements in itskey. These elements enable efficient operation of various aspects of thesystem. These elements include Bucket ID 19, collection ID 20, batch ID21 and uniqueness 22.

The process for storing records in the record table is designed tosupport both interactive retrieval of a set of records identified asmeeting the criteria of a specific search, as well as efficient bulkprocessing of data in batches across time, to support analytics. Thisorganization involves distributing records into ‘buckets’ that allow anatural distribution of buckets of records across multiple servers,thereby providing parallelism when processing data. To create the recordtable 11 each new record is assigned a bucket ID 19 using a round-robinprocess (i.e. assigning records to buckets in order, from 1 to N where Nis the number of buckets, and starting over at 1, and so on) and also abatch ID 21 that is comprised of a timestamp divided by a given batchduration (for example, some number of seconds). The full record formatis shown as in FIG. 4, but the operative elements are the bucket ID,which causes records to be stored uniformly throughout the record table11, and the bucket ID which causes records to be stored in a continuousblock of records within each bucket, because the underlying sortedkey-value store organizes records into blocks via the sorting mechanism.The collection ID 20 is assigned when the collection is created and acollection name is provided by the user.

When follow-on processes are directed to read over records, they areprovided the list of bucket IDs and a set of batch IDs to processes. Adistributed job coordinator creates N independent processes (workers),each of which is assigned a one of the bucket IDs 19 and a batch ID 21from the lists. Each of these workers then accesses a continuous blockof records from the underlying key-value store using sequential diskaccess, which provides high throughput record reads. This highthroughput is the result of each block of records being located onseparate machines, in contiguous blocks, so that data on the individualservers is accessed simultaneously using sequential read patterns, whichare much more efficient than random access patterns for conventionalhard disk drives.

Often in order to answer an analytical question, one or more collectionsmust be processed via bulk analytics (i.e. transforms) that combine,summarize, filter, or aggregate information in original collections.FIG. 5 shows a notional workflow containing two input sources 26, 27,two collections containing raw data 23, 25, three bulk analytics 27, 28,30, two of which are running on raw data 27, 28 and one of which isrunning on the output of another analytic (30). And two collectionscontaining analytic results 24, 29.

Transforms enable bulk analytical workflows and relies on a system thatautomatically distributes data across multiple servers and allows forcode to be executed on those servers. The code processes the data localto each server. Some key-value stores 2 can be used as the underlyingsystem that feeds data to a MapReduce job, if they allow the localexecution of code applied to some partition of the data each servermanages. This system builds on that capability to enable analyticalworkflows, which is the chaining of multiple MapReduce jobs into alogical analytic workflow, called a transform in which each individualMapReduce job executes on the servers participating in a distributedkey-value store and where the output of intermediate MapReduce job iswritten to either the key-value store or to a distributed file system,and where the output of the final MapReduce job is written to thekey-value store. The output of each transform is written to acollection. Multiple transforms can be chained together to create morecomplex workflows. This has the advantage of allowing the input recordsand output records of each analytical workflow to be queried in-siturather than needing to be made searchable by copying data into aseparate query system. An example of a complex analytic work flow isshown in FIG. 5, where elements 27, 28, 29 are transforms and elements23, 25, 24, and 29 are collections that can be queried in-situ.

In accordance with further embodiments of the present invention, atransform 61, consisting of one or more MapReduce (also referred toherein as “transform”) jobs 60 may be defined by a developer in, forexample, a Java class, containing a list of MapReduce ‘stages’ as shownin FIG. 11. Each stage is preferably of a particular type, either a MapStage 57, Combine Stage 58, or Reduce Stage 59. Stages may be furtherdifferentiated by whether they are the initial stage in a chain, anintermediate stage, or the final stage. Each stage contains thedefinition of one method, either a map ( ) method, a combine ( ) method,or a reduce ( ) method, which operate on Record representationscontaining an association of fields to values, or simple key-valuepairs.

The system executes a transform by bundling stages into individualMapReduce jobs 60 (FIG. 11) which can consist of either a single mapstage, map and combine stages, map and reduce stages, or all three map,combine, and reduce stages—when there is no map stage an identity mapstage is used to simply pass unmodified records onto either the initialcombine stage or reduce stage. A transforms stages may end up beingbundled into multiple MapReduce jobs 60, but each stage belongs to onlyone MapReduce job. The system takes care of configuring each job toretrieve its input from either the key-value store, or from adistributed filesystem storing the results of a previous job, and alsoconfiguring the output to go to either the distributed filesystem forprocessing by a later job or to the key-value store for storing finalresults.

System 100 features an organization of records in the key-value store 2that groups records into batches to support the efficient update ofthese workflows making a series of MapReduce analytics reusable,configurable, and secure.

The skill level required to create distributed analytical bulk processesis high, and therefore scarce. The parameterization of such processesenables them to be written once and reused many times by users otherwiseunable to write these analytics on their own. The parameterization ofthese analytics, or transforms, involves providing developers with anAPI that includes a method for declaring which parts of the transformare configurable by an end user at run time. As illustrated in FIG. 6and discussed below, this method of parameterization involves theMapReduce application being extended by adding an additional contextobject (31) that contains a set of parameters 50 that represent theconfigurable parts of a MapReduce job (transform) 32. Each parameterspecifies what type it is, whether a line or more of text, a number, thename of a field, etc., whether it is a required field, a default value,and a description. Parameters are then read by the system when a packageof MapReduce code (usually a JAR file) is uploaded to the system. FIG. 6illustrates how analytic code can be quickly applied to any schema ofdata at run time.

Parameters for a particular MapReduce job 32 are presented to users viaa user interface (UI) 33 prior to launching. Users can configure whichdata record collections the job 32 should use as input, to which datarecord collection output records should be written, and any parametersspecified by the job. Any required parameters not filled out, or filledout parameters with values that are invalid will cause an attempt tolaunch the job to fail and prompt users to fill out the requiredparameters. This allows MapReduce code to apply to a variety of datarecord collections differing in record structure and for the applicationof the same MapReduce code in different ways from job to job viadiffering parameter values.

Code internal to the MapReduce job 32 then retrieves the user-configuredvalues of these parameters from the management database 34 and uses themto modify the way processing of records is accomplished. Analyticaltransforms that execute as MapReduce jobs take advantage of this recordstructure by retrieving the necessary fields and processing theirvalues. The set of fields that are processed can vary from execution toexecution and are defined as parameters by the developer of thetransform. These parameters are filled in by data workers who createspecific configurations of these transforms to run over one or more setsof data records. This way data workers who have not written thetransform can use the transform over and over in different contexts.

Further, the data records that are read by transform 32 may be filteredbased on the combination of the security labels (attached to individualrecords) and the security tokens presented by the data worker schedulingthe transform for execution. As transforms read over data records andproduce new data records the transform may add additional securitylabels as necessary, but after new records are passed to the system tobe stored, the system applies a new clause to the security label (whichmay simply be empty) that includes a unique token identifying the set ofoutput records. This unique token may be presented by any queries ortransforms wishing to read the records. Initially no user except thedata worker who created the collection may possess this token.

Maintaining representative samples of large datasets efficiently forlarge data collections. Many analytical and statistical algorithms aredesigned to run on a sample of a data set. In a distributed system itcan be difficult to maintain a representative sample without knowing howmany data items have been ingested and how many data items exist total.

FIG. 10 illustrates a data ingest and sampling architecture according toa preferred embodiment of the invention, and further shows external datapassing through processing nodes and being written to specific tables inthe key value store 2. To maintain a representative sample 36 of adataset by having each ingest node (35) generate a hash of each recordand keep in memory a set of some number N of records whose hashescomprise the minimum N hash values for the duration of the ingest batch.As hashes of records are designed to be uniformly distributed throughoutthe hash space, the minimum N records constitute a representative sampleof the records a single ingest node has processed.

At the end of the ingest batch, each ingest node writes its N samplerecords to a the sample table 36 in a key-value store 2 that storeskey-value pairs in sorted order by keys. In this table the key is thehash of the record and the value is the serialized record. The first Ncombined samples in this table now constitute a representative sample ofthe entire dataset processed thus far. In future ingest batches thistable is updated with new hash values and the representative nature ofthe first N records in this table is preserved. Any records that sortafter the first N records can be discarded, usually by some separategarbage collection process. The particular hash function used is notterribly important, and an MD5 or an SHA based hash function willsuffice, for example.

The actual data written to the sorted key-value store 2 consists of acollection name followed by the hash value as they key and the originalrecord is stored in the value.

When retrieving a sample of records for a particular collection, a scanis begun at the first key that begins with the name of the collectionthat proceeds until the number of sample records desired is retrieved oruntil the sample records in the table are exhausted.

A discussion of schema-free ingest and indexing follows. To reduce theeffort and complexity in importing data, particularly new and changingdata, a schema-free ingest is required. Fundamentally the key tobuilding a schema free ingest is to ensure that the entire ingestprocess is built such that it does not rely on the logical structure ofthe data to execute ingest. This means that the set of fields, and typesand lengths of values do not have to be known or used by the system aspart of ingest. Both values and fields are passed directly into thesystem. Collection based meta-information is gathered as data isingested and is stored in the relational database 3 to assist indeveloping indexing rules (which can be overridden by the user). Alsovalue-based meta information (cardinality, presence etc.) is collectedon ingest and stored separately in the stats table 13 in the key valuestore 2 to quick retrieval.

This is implemented via some number of physical data format parsersdesigned to read fields and values. Each parser is designed to discoverthe set of fields and values within source data by parsing only thephysical format of the data. Examples of physical formats include XML,CSV, JSON, relational database tables, etc. Any field names discoveredare noted and stored with the values when imported into the system. Thetype of each value is discovered as part of the parsing process and thevalue and its type information are also stored in the system under thefields in which they were discovered. Optionally, some type detectionalgorithms can be applied to values to further distinguish types fromcommon representations, such as string representations of dates ornumbers. When a more specific type is found for a value, the value isstored associated with the more specific type. Fields may contain‘simple’ values such as strings, numbers, etc., and also may be‘complex’ values consisting of lists of values or more maps of fields tovalues. The system can handle an arbitrary nesting of fields and values.

Each unique value type is also indexed in such a way as to enable rangescans over values when they are stored in a sorted key-value store thatsorted values in lexicographical order, byte by byte. This requires thatsome types of values be transformed when stored in the index table,consisting of key-value pairs sorted by key. Each type has its ownencoders to transform original values to appropriate representations inthis index table 12. New types can be added to the index by creating newtype encoders.

A discussion of optimizations around multi-dimensional range scans(composite indexes) follows. When indexing multi-dimensional data, suchas latitude and longitude, techniques to map these multiple dimensionsto one dimension can enable the efficient retrieval of data pointsfalling in a multi-dimensional box. Use of Z-order curves to interleavevalues of a given record supports fast query on range scans acrossmultiple values. In addition, this capability is provided via a simpleuser interface 1 and API calls, not requiring the user to do anydevelopment to create and maintain the indexes. These indexes can bebuilt on 2 up to any arbitrary number of values in a collection,although it's optimized for a low number of fields (2-5). The Z-orderedcurve is built from the interleaved bytes of each indexed value.

A discussion of indexing multiple types of data from multiplecollections for efficient cross collection and cross-field queriesfollows. As illustrated in FIG. 11, an index structure may be createdthat interleaves the entries for values from multiple collections andmultiple fields. For every value in a Record that gets indexed, at leasttwo entries are written to the index table 37. In general index entriesconsist of a search term 38 (with any collection or field specificinformation required 39, 40 which comprises the beginning portion of akey in a key-value pair, and a record ID 41 associated with that key(FIG. 2).

An entry that is not prefixed by the field name 42. For example, if thefield name were ‘First name’ and the value ‘Bob’ we write an index entryas follows: $any_Bob. Note that the value ‘Bob’ appearing in any fieldwill result in the same index entry.

An entry that is prefixed by the field name 43. For example, if thefield name is ‘First name’ and the value is ‘Bob’ we write an indexentry as follows: First name_Bob

When querying without a field name 42, a key is constructed thatprepends the ‘$any’ field to the beginning of the search term. Thisallows a user querying for ‘Bob’ in any field to find the index entry,$any_bob, in the index.

When writing index entries, the collection id 39 is prepended. So whenwriting an index entry for the field ‘First name’ and value ‘Bob’appearing in the collection ‘Customers’, we write the index entry:Customers_First name_Bob and also an entry, Customers_$any_Bob.

Using a single index table for multiple collections. We store all indexentries created from any data record collection in the same index table12 in a sorted key-value store. When querying multiple collections atonce, we create keys for each collection specified by the user and queryfor terms in each collection. Queries consist of a scan over the set ofkey-value pairs beginning with a collection ID, field information, andsearch term to retrieve a set of record IDs. The full records identifiedwith each record ID are then typically retrieved by doing additionallookups on a separate table containing all the full records as values,and record IDs as keys.

A discussion of index structures that don't include the field name toenable query across fields follows. Both Role Based Access Control(RBAC) and Mandatory Access Control (MAC) offer users powerful ways ofprotecting data from unauthorized access. Many systems offer only onemethod or the other. FIG. 7 illustrates the implementation of recordlevel Mandatory Access Control (MAC) with collection level Role BasedAccess Control (RBAC) security models, a system for controlling accessbetween users, groups, and data record collections using access tokenexpressions. A preferred embodiment includes the ability to determinethe access tokens 44 of the user (45) importing records and constructBoolean expressions of these tokens for each field of each record thatthe user stores. The access tokens that a user possesses is determinedby the user's security specification and the security specification ofany groups the user is a member of. Note the possession of tokens by auser can be manipulated at any time to grant or revoke access.

A data record collection 47 consists of some set of data records 48stored in the system. Each record within a data record collection islabeled 49 upon import into the system with a unique generated token,specific to one particular data record collection to which it belongs.

A security label expression 62 consists of one or more tokens separatedby Boolean operators & and | which describe which tokens the systemshould require. Two tokens or sub-expressions separated by & are bothrequired. Only one of two tokens or sub-expressions are required whenthey are separated by |. In addition parentheses ( ) can be used togroup sub-expressions.

Some records may have security expressions which can be parsed by thesystem and added to the final security label expression that is storedwith the record. In this case, a user 45 must possess both the token 44specific to the data record collection and any additional tokensrequired by the additional security expression 62 of a particular recordin order to read the record. The key value store carries out thiscomparison 63.

Initially only the importer of external data 64 into a data recordcollection 47 possesses the token 44 required to read records from thisdata record collection. The token to read this data record collectioncan be assigned to additional groups consisting of zero or more users inorder to grant those users the ability to read records from this datarecord collection.

Each time a user queries the system 100, the set of tokens assigned toany group to which a user belongs is retrieved and passed to theunderlying key-value store which uses these tokens to attempt to satisfythe security label expression stored on each record, and the systemfilters out any record whose security label expression is not satisfiedby the set of tokens the user possesses.

A client that allows users to query this system must handleauthenticating users, and retrieving an authenticated user'sauthorization tokens from an external system and pass those tokensfaithfully onto the read API of the key-value store 2, which in turnuses those tokens to filter records.

In accordance with a preferred embodiment of the invention, the systemis implemented by using security label parsers 49 to parse any securityinformation found on external records when they are imported into thesystem and translating them to security label expressions 62 that thekey-value store understands indicating that a user must possess whatevertokens are required to satisfy the record-specific security labelexpression in order to read the data.

To establish the veracity of analytical results, examining theprovenance of data as it is processed from the original source throughanalytical transforms is important. Whenever a MapReduce transform 28,27, 30 is run using one or more collections 23, 25, 28 as input, thesystem maintains a list of batches within each input collection that hasalready been processed, and the IDs 20 of batches are generated andwritten as output to a new collection.

Preferably, every transform launched will cause some records to bewritten to a relational database 3 containing the names or IDs of datarecord collections read as input, the name of the bundle of MapReducecode used, and the configuration options of the particular job. Thisrecord is all that is required to reconstruct a history of how a datarecord collection has been used to create derivative collections, andalso to reconstruct the history of what processes were involved increating a particular data record collection, via simply SQL queriesagainst the relational database. In addition, batch-level informationstored in these records can help identify which batches from inputcollections contributed to which batch of an output collection.

While the preferred embodiment of the invention has been illustrated anddescribed, as noted above, many changes can be made without departingfrom the spirit and scope of the invention. Accordingly, the scope ofthe invention is not limited by the disclosure of the preferredembodiment. Instead, the invention should be determined entirely byreference to the claims that follow.

We claim:
 1. A method for efficiently importing and indexing a pluralityof data sets, each data set of the plurality having a schema differentfrom every other data set of the plurality, each data set of theplurality comprising at least one field containing at least one value,the at least one value having a type and length, the method comprising:automatically identifying the physical format of each data set of theplurality; for each data set of the plurality, automatically identifyingthe at least one field and at least one value; for each data set of theplurality, automatically identifying a name of the at least one field;and storing each value indexed to the name of the field containing eachsaid value in a data set of the plurality.
 2. The method of claim 1,further comprising for each data set of the plurality, automaticallyidentifying the type of the at least one value.
 3. The method of claim2, further comprising storing each type of the at least one valueindexed to the name of the field containing each said value in a dataset of the plurality.
 4. The method of claim 2, wherein the typecomprises a numerical string.
 5. The method of claim 1, wherein eachdata set of the plurality is in a file format of a plurality of fileformats, and the at least one field and at least one value of each dataset of the plurality are automatically identified based on the fileformat of each said data set.
 6. The method of claim 2, wherein eachdata set of the plurality is in a file format of a plurality of fileformats, and the type of the at least one value of each data set of theplurality is automatically identified based on the file format of eachsaid data set.
 7. The method of claim 1, wherein each value is stored ina sorted key-value store.
 8. At least one computer-readable medium onwhich are stored instructions that, when executed by at least oneprocessing device, enable the at least one processing device to performa method, comprising the steps of: automatically identifying thephysical format of each data set of the plurality; for each data set ofthe plurality, automatically identifying the at least one field and atleast one value; for each data set of the plurality, automaticallyidentifying a name of the at least one field; and storing each valueindexed to the name of the field containing each said value in a dataset of the plurality.
 9. The method of claim 8, further comprising foreach data set of the plurality, automatically identifying the type ofthe at least one value.
 10. The method of claim 9, further comprisingstoring each type of the at least one value indexed to the name of thefield containing each said value in a data set of the plurality.
 11. Themethod of claim 9, wherein the type comprises a numerical string. 12.The method of claim 8, wherein each data set of the plurality is in afile format of a plurality of file formats, and the at least one fieldand at least one value of each data set of the plurality areautomatically identified based on the file format of each said data set.13. The method of claim 9, wherein each data set of the plurality is ina file format of a plurality of file formats, and the type of the atleast one value of each data set of the plurality is automaticallyidentified based on the file format of each said data set.
 14. Themethod of claim 8, wherein each value is stored in a sorted key-valuestore.